Complex Document Information Processing: Towards Software and Test Collections
نویسندگان
چکیده
Complex document information processing (CDIP) is the analysis of combinations of handwritten notes, diagrams, and graphics in addition to printed and formatted text. Rapid analysis of collections of complex documents is increasingly critical to intelligence appraisals of hostile forces and other entities of interest. Such collections may be seized by military or law enforcement (recently in Iraq, Afghanistan, and elsewhere), or turned over by cooperating parties (e.g. defectors). This material must be scanned or otherwise input, understood, and exploited all in a short period of time if it is to be of maximum value. Such collections may contain many types of complex documents and data, and ideally should be interpreted in the context of previously obtained collections and knowledge. Unfortunately, most current tools are specialized to one kind of data (text content or style, handwriting, graphics, database records, etc.) and sometimes only one genre of document. Extracting intelligence from such collections therefore relies heavily on manual partitioning of data, coordination of analysis tools, and collation of results. Besides being slow, expensive, and demanding on the time of trained analysts, the current approach has two fundamental problems. First, the speed and accuracy of analyzing one data type can often be improved if partial analyses from other data types were available, but a collection of standalone tools does not support this. Second, manual collation, cross-checking, and consistency checking is slow and error-prone when dealing with large numbers of documents, further reducing the value of obtained information. The operational problem, therefore, is the dearth of unified tools for CDIP in both commercial and research spheres. That in itself, however, is symptomatic of a second problem: the lack of standard test collections for CDIP research. Test collections of documents (and manual judgments/annotations) in text retrieval and categorization, handwriting recognition, and data mining have sparked intense research and commercial development. Conversely, CDIP research and development is stifled by the lack of a test collection of complex documents. Creating a CDIP test collection is thus an essential component of any serious CDIP effort. II. BUILDING A CDIP SYSTEM
منابع مشابه
Educe: Enhanced Digital Unwrapping for Conservation and Exploration of Inaccessible Texts
Analysis of large collections of complex documents is increasingly critical to forming accurate and precise intelligence appraisals of enemy activity. Complex documents include handwritten notes, diagrams, and graphics in addition to printed and formatted text. Collections may comprise many different types of complex documents and data, hence extracting useful intelligence from such collections...
متن کاملContent-based document image retrieval in complex document collections
We address the problem of content-based image retrieval in the context of complex document images. Complex document are documents that typically start out on paper and are then electronically scanned. These documents have rich internal structure and might only be available in image form. Additionally, they may have been produced by a combination of printing technologies (or by handwriting); and...
متن کاملConstructing Japanese test collections for spoken term detection
Spoken Document Retrieval (SDR) and Spoken Term Detection (STD) have been two of the most intensively investigated topics in spoken document processing research according to the establishment of the SDR and STD test collections by the Text REtrieval Conference (TREC) and NIST. Because Japanese spoken document processing researchers also requires such test collections for SDR and STD, we have es...
متن کاملروش جدید متنکاوی برای استخراج اطلاعات زمینه کاربر بهمنظور بهبود رتبهبندی نتایج موتور جستجو
Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...
متن کاملSemantically-Based Active Document Collection Templates for Web Information Management Systems
Representing and processing semantic information regarding individual documents is desirable but not sufficient. To improve the efficiency and reusability of users’ work with Web-based information management systems, it is essential to handle document collections. We describe techniques for representing semantics both of collections and of information management services that operate upon them....
متن کامل